SSAST: Self-Supervised Audio Spectrogram Transformer
نویسندگان
چکیده
Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional (CNNs) various vision tasks, thus extending success of Transformers, which were originally developed for language processing, domain. A recent study showed that a similar methodology can also be applied audio Specifically, Audio Spectrogram (AST) achieves state-of-the-art results classification benchmarks. However, pure tend require more training data compared CNNs, and AST relies supervised pretraining requires large amount labeled complex pipeline, limiting practical usage AST. This paper focuses speech classification, aims reduce need amounts by leveraging self-supervised using unlabeled data. we propose pretrain model joint discriminative generative masked spectrogram patch modeling (MSPM) from AudioSet Librispeech. We evaluate our pretrained both tasks including event keyword spotting, emotion recognition, speaker identification. The proposed framework significantly boosts performance all an average improvement 60.9%, leading or even better than To best knowledge, it is first patch-based in domain,
منابع مشابه
Audio Chord Estimation Using Chroma Reduced Spectrogram and Self-similarity
In this paper we describe a method of audio chord estimation than does not rely on any machine learning technique. We calculate a beat-synchronized spectrogram with high time and frequency resolution. The sequence of chroma vectors (CRP features based on constant-Q transform) obtained from spectrogram is smoothed using self-similarity matrix before the actual chord recognition. Binary chord tem...
متن کاملCalibNet: Self-Supervised Extrinsic Calibration using 3D Spatial Transformer Networks
3D LiDARs and 2D cameras are increasingly being used alongside each other in sensor rigs for perception tasks. Before these sensors can be used to gather meaningful data, however, their extrinsics (and intrinsics) need to be accurately calibrated, as the performance of the sensor rig is extremely sensitive to these calibration parameters. A vast majority of existing calibration techniques requi...
متن کاملUniversität Augsburg Audio Brush : Editing Audio in the Spectrogram
A tool for editing audio signals in the spectrogram is presented. It allows manipulating the spectrogram of a signal at any chosen time-frequency resolution directly and to reconstruct the edited signal in HiFi quality – a capability that is usually not possible with the Fourier or wavelet transformation. Image processing and computer vision methods are applied to the spectrogram in order to id...
متن کاملUniversität Augsburg Audio Brush : Smart Audio Editing in the Spectrogram
Starting with a novel audio analysis and editing paradigm, a set of new and adaptive audio analysis and editing algorithms in the spectrogram are developed and integrated into a smart visual audio editing tool in a “what you see is what you hear” style. At the core of our algorithms and methods is a very flexible audio spectrogram that goes beyond FFT and Wavelets and supports manipulating a si...
متن کاملSupervised Transformer Network for Efficient Face Detection
Large pose variations remain to be a challenge that confronts real-word face detection. We propose a new cascaded Convolutional Neural Network, dubbed the name Supervised Transformer Network, to address this challenge. The first stage is a multi-task Region Proposal Network (RPN), which simultaneously predicts candidate face regions along with associated facial landmarks. The candidate regions ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2022
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v36i10.21315